[Serve][2/N] Implement `AcceleratorConfig` to enable custom scheduling logic for accelerators with Serve deployments by ryanaoleary · Pull Request #63179 · ray-project/ray

ryanaoleary · 2026-05-07T02:53:03Z

Description

This PR introduces a structured AcceleratorConfig (starting with TPUAcceleratorConfig) for Ray Serve deployments to support advanced accelerator provisioning. Deployments with accelerator_config set use a per-replica PG creation path that dispatches to slice_placement_group for TPU. Gang scheduling is bypassed for these deployments - SlicePlacementGroup is itself a gang-scheduling primitive, so layering Gang PG on top would solve the same problem twice.

Specific Changes:

API & Configuration

Added AcceleratorConfig and TPUAcceleratorConfig Pydantic models defining hardware requirements (topology, version, chips per VM).
Added bytes accelerator_config to serve.proto and threaded through ReplicaSchedulingRequest and CreatePlacementGroupRequest.

Per-Replica PG Creation

Added ReplicaPlacementGroup wrapper delegating shutdown() and release_head_pgs() to the underlying TPU-specific PG.
Added _create_replica_placement_group as the internal scheduler entry point; dispatches on accelerator_config and wraps the result. _default_create_placement_group's public signature is unchanged, so external create_placement_group_fn_override users keep working.
Deployment-state cleanup calls ReplicaPlacementGroup.shutdown() on teardown and release_reservation_holders() after worker PG readiness.

Edit: I scoped this PR way down to not include unrelated Gang PG changes - which can be in a separate PR.

Related issues

#57137

Additional information

Optional: Add implementation details, API changes, usage examples, screenshots, etc.

gemini-code-assist

Code Review

This pull request introduces a structured AcceleratorConfig for Ray Serve to support TPU slice reservations. It implements a new placement group management layer (_ReplicaPlacementGroup) that handles accelerator-specific lifecycle tasks, such as releasing head placement groups after scheduling. The changes span the Serve controller, deployment scheduler, and LLM engine configurations to enable per-host TPU bundle allocation. Review feedback highlights a critical runtime error where an invalid label_selector is passed to actor options, identifies missing logic for passing user-defined bundle label selectors, and notes a documentation mismatch in the TPU utility classes.

abrarsheikh · 2026-05-07T16:24:23Z

please break up the PR, atleast into serve parts first then the llm changes, would make it easier to review

… tests Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

…t `bundle_label_selector` Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2026-05-08T03:03:40Z

please break up the PR, atleast into serve parts first then the llm changes, would make it easier to review

Sounds good I'm going to make this PR the Serve changes (although it includes changes from #63171 for now for tests to work, but this should merge first). #63216 will include the changes from this PR so that integration tests work, and the LLM specific changes.

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2026-05-20T22:17:34Z

will fix merge conflicts once: #63177 is merged

ryanaoleary · 2026-05-20T23:35:17Z

tpu.py change in this PR is from a separate one that is already approved and just needs merged, this PR won't include that change

jeffreywang-anyscale

Took a first pass -- could we break the PR down into the followings?

Config / frontend only: ensure that AcceleratorConfig is plumbed through @serve.deployment, Deployment.options, DeploymentSchema (declarative YAMLs), protobuf surfaces. I think we're missing some plumbings in this PR.
Scheduler and state reconciliation

Tests haven't been reviewed.

jeffreywang-anyscale · 2026-05-21T03:39:30Z

Some other questions:

Lifecycle of `head_pg` and `worker_pg`

head_pg is claimed first to retrieve the slice name.
A worker_pg with num_hosts bundles is created using that slice name to claim the entire slice.
Once worker_pg has been created, the head_pg is released.

Question

At that point, is it possible for another replica to successfully claim head_pg, retrieve the same slice name, and then attempt to create its own worker_pg, only to discover that the slice is still occupied by the existing worker_pg?
What does the controller do in this case?
If the worker_pg's 0th bundle also ask for head_pg's bundle, could we avoid this race?
Taking a step back, can we tolerate this race or do we want to avoid it?

Co-authored-by: Jeffrey Wang <jeffreywang@anyscale.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

…strings, change from Dev API to PublicAPI, and fix other comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2026-05-21T08:49:24Z

Some other questions:

Lifecycle of head_pg and worker_pg

head_pg is claimed first to retrieve the slice name.

A worker_pg with num_hosts bundles is created using that slice name to claim the entire slice.

Once worker_pg has been created, the head_pg is released.

Question

At that point, is it possible for another replica to successfully claim head_pg, retrieve the same slice name, and then attempt to create its own worker_pg, only to discover that the slice is still occupied by the existing worker_pg?

Yeah that's possible and is a valid issue, the current behavior would allow a slice PG call to discover a seemingly available TPU head, attempt to reserve it, and leave the worker PG hanging indefinitely.

What does the controller do in this case?

The controller would just timeout waiting for the PG to become ready.

If the worker_pg's 0th bundle also ask for head_pg's bundle, could we avoid this race?

This wouldn't work because the two PGs would be in contention for the same resource, if we release the head_pg first we risk a race with another slice claiming it.

Taking a step back, can we tolerate this race or do we want to avoid it?

Yeah I think we should fix this, the simplest solution is to just go with what SlicePlacementGroup currently does by deafult - continue to hold the head_pg until shutdown is called, and then release both PGs at the same time. I'll refactor the logic so that we aren't releasing the former PG early. Since they're both being managed by the ReplicaPlacementGroup abstraction this shouldn't be an issue / cause more complexity in the code.

Should be fixed with 26ae3e2

… change default to SPREAD strategy Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 26ae3e2. Configure here.}

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2026-05-21T09:08:35Z

addressed outstanding design related comments / bugs, will work on splitting this PR into two smaller ones

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2026-05-21T19:57:49Z

Opened up #63581 to address #63179 (review), adding just the AcceleratorConfig, related classes, and the plumbing through the Serve proto and deployment, etc. That should be merged first and then this PR is just the scheduler change. cc: @jeffreywang-anyscale

jeffreywang-anyscale · 2026-05-21T19:58:30Z

Opened up #63581 to address #63179 (review), adding just the AcceleratorConfig, related classes, and the plumbing through the Serve proto and deployment, etc. That should be merged first and then this PR is just the scheduler change. cc: @jeffreywang-anyscale

Nice thank you! Taking a look now.

Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>

ryanaoleary requested review from a team as code owners May 7, 2026 02:53

gemini-code-assist Bot reviewed May 7, 2026

View reviewed changes

ryanaoleary force-pushed the e-serve-accelerator-config branch from 6885e61 to aede10e Compare May 7, 2026 02:56